Teacher-student approach for lung tumor segmentation from mixed-supervised datasets

Purpose Cancer is among the leading causes of death in the developed world, and lung cancer is the most lethal type. Early detection is crucial for better prognosis, but can be resource intensive to achieve. Automating tasks such as lung tumor localization and segmentation in radiological images can free valuable time for radiologists and other clinical personnel. Convolutional neural networks may be suited for such tasks, but require substantial amounts of labeled data to train. Obtaining labeled data is a challenge, especially in the medical domain. Methods This paper investigates the use of a teacher-student design to utilize datasets with different types of supervision to train an automatic model performing pulmonary tumor segmentation on computed tomography images. The framework consists of two models: the student that performs end-to-end automatic tumor segmentation and the teacher that supplies the student additional pseudo-annotated data during training. Results Using only a small proportion of semantically labeled data and a large number of bounding box annotated data, we achieved competitive performance using a teacher-student design. Models trained on larger amounts of semantic annotations did not perform better than those trained on teacher-annotated data. Our model trained on a small number of semantically labeled data achieved a mean dice similarity coefficient of 71.0 on the MSD Lung dataset. Conclusions Our results demonstrate the potential of utilizing teacher-student designs to reduce the annotation load, as less supervised annotation schemes may be performed, without any real degradation in segmentation accuracy.


Introduction
Cancer is becoming the leading cause of death and the most significant obstacle to increase life expectancy in many countries [1]. Lung cancer, accounting for more than 11% of all new cases, is the second most common cancer and it ranks first among the cancer-related mortality worldwide, accounting for 18% of the total cancer deaths [2]. The most common lung cancer treatments include: surgical resection, chemotherapy, radiotherapy, and immunotherapy. Many of these treatments, and also the successful diagnosis with bronchoscopy or computed tomography (CT)-guided biopsy, depend on accurately locating, and in many cases delineating (segmenting), the tumor from normal tissue in the preoperative images, typically CT.
Manual segmentation of the lesions/tumors from preoperative CT is a laborious and tedious process for oncologists, radiologists, and pulmonologists, which could result in delays of treatment and lower the survival rates, especially in clinics with inadequate resources. In addition, the quality of manual localization and segmentation relies on prior knowledge and clinical expertise. Even with adequate guidelines and standards, tumor segmentation is often prone to high inter-and intraobserver variability. On the other hand, automatic segmentation techniques has the potential to provide efficient, consistent, and more accurate results. Automatic methods can both shorten the time needed to read the images and they also allow experts to devote their limited time to optimize planning and treatment planning.

Related work
Historically, methods like thresholding [3], region growing [4,5], and graph cuts [6] were commonly proposed to segment lung tumors from CT images. These algorithms are suitable as semi-automatic methods, but are not suited for localization of lung tumors. Recent advancements in deep learning enables automation of tasks that until recently was only performed by trained experts [7,8]. Advancements in hardware has enabled development of larger and increasingly complex models, but much of the improvement is caused by access to large amounts of annotated data.
Today, especially after the introduction of the U-Net [9] architecture, deep learning methods have dominated the field of medical image segmentation [10]. However, convolutional neural networks (CNNs) are memory intensive, especially for 3D volumes. It is therefore common to train networks based on 2D or 2.5D input images [11], where the model evaluates one slice at a time, chunks of slices, or 3D patches, and then applied in a sliding windows fashion across the CT volume. Patching the 3D volume comes at a cost of loss of perception, and thus more efficient multi-scale 3D CNN architectures have been proposed, which enables the use of larger input volumes [12]. An alternative approach is to perform segmentation in multiple steps, either using multiple algorithms or a cascade of CNNs [11,[13][14][15].
To accommodate the issue of lacking training data, unsupervised methods like supervoxel has been proposed [16]. To facilitate faster convergence and more accurate results, multimodality methods that utilize magnetic resonance imaging (MRI) or positron emission tomography (PET) scans in addition to CT have been suggested [17][18][19]. Neural network architectures that can utilize multiple annotation types has also been suggested [20,21].
A more recent strategy to accommodate the lack of training data is the teacher-student design, inspired by the concept of knowledge distillation [22][23][24]. The teacher creates pseudo-annotations from suboptimal annotations to increase the dataset size for training the student. The teacher-student pattern can be applied to any type of network architecture, and does not dictate other hyperparameters or external configurations. A teacher-student design can be used in different ways, from utilization of unlabeled data [25][26][27], to exploitation of multiple modalities [18,19,27], and to usage of datasets with different annotation types [28,29].

Contributions
Our approach differs from the previously mentioned methods applied to lung tumor segmentation by utilizing annotations of different supervision on the same modality, namely CT images. Since CT scanning is less invasive to the patient than the other modalities, it is a goal to efficiently segment tumors from CT-only examinations. Our method is inspired by Sun et al. [28] that shows promising results using a teacher-student framework to segment liver and liver lesions given semantic and bounding box annotations. To the best of our knowledge, we are the first to implement a similar teacher-student framework for CT images to perform semantic segmentation of lung tumors. Our study suggests that even with a small dataset of semantic annotations, a student can achieve state-of-the-art performance given a large enough pseudo-annotated dataset to learn from.

Materials and methods
Our method consists of two separate models: a semi-supervised teacher and a fully-automatic student. The method relies on two different annotation types: semantic 3D annotations and 2D bounding boxes in the axial planes. These we refer to as strong and weak annotations. Furthermore, we define our strongly and weakly annotated datasets as , respectively. An overview of our design can be seen in Fig 1.

Data
To study the effect of our teacher-student framework we used three public datasets: Medical Segmentation Decathlon (MSD)-Lung [30], Non-Small Cell Lung Cancer (NSCLC)-Radiomics [31,32], and Lung-PET-CT-Dx [33]. All three datasets contain manual annotations by human experts. The first two datasets consist of semantic annotations, whereas the latter dataset contains bounding box labels annotated in the axial plane. The MSD-Lung dataset contains 64 images, whereas the NSCLC-Radiomics and Lung-PET-CT-Dx datasets contains 422 and 1295 images, respectively. Multiple images in the Lung-PET-CT-Dx dataset were discarded. The discarded images were either PET or PET/CT-fused, only contained a small portion of the thorax, or comprised of multiple scans stacked on top of each other. After removing all non-CT images and images with a real-world length (Z-axis) outside the range [16,60] cm, 665 images from Lung-PET-CT-Dx remained in our dataset.
The three datasets varied in terms of voxel density and tumor sizes. Overall, the Lung-PET-CT-Dx and the NSCLC-Radiomics datasets contain larger tumors than the MSD-Lung dataset (see Table 1). The tumor diameter is an approximate size, measured by calculating the average of the longest and shortest diameter of the tumor in real-world coordinate space.

Preprocessing
Our preprocessing pipeline consisted of multiple steps. Firstly, the voxel intensities were clipped to the range [-1024, 1000], before being standardized using the Z-score normalization method. The images' voxel spacing were then normalized to an anisotropic resolution of 1 × 1 × 1.5 mm 3 . Lastly, a volumetric cropping was applied, which differed between the teacher and the student.
For the teacher, the images were cropped around the tumor with a fixed resolution of 128 × 128 × 128 voxels, whereas for the student, the images were split in two, each cropped around one of the lungs. The lungs were automatically segmented using the lungmask command line tool [34], and used when performing cropping around the lungs. The ground truth label images were voxel normalized and cropped in a similar manner as their corresponding CT image.

Teacher-student design
The teacher was trained on 3D patches surrounding the tumor, guided by the corresponding bounding box annotations. Once trained, the teacher was applied to D w to generate pseudostrong labels, D w 0 . Although expert labeled images are the gold standard, teacher pseudo-

PLOS ONE
Teacher-student approach for lung tumor segmentation from mixed-supervised datasets annotated images can enhance training of fully automatic models, or even be used to aid experts in clinical use. The student, like any ordinary automatic method, takes CT images as input and produces 3D segmentations of the potential lung tumors without user interaction. During training, the student exploits the pseudo-annotated images in D w 0 produced by the teacher, using the extended dataset, {D s , D w 0 }. Once trained, the student can perform end-to-end segmentation without human intervention. Algorithm 1 describes the training scheme. W x and S x denote inputs of weakly labeled and strongly labeled dataset, respectively. Likewise, W y and S y denote weak and strong annotations.
⊳ Store input and teacher-annotated output in ⊳ Prediction on image 8: loss calculate_loss(y, y 0 ) ⊳ Calculate loss from output and label 9: student.adjust_weights(loss) ⊳ Backpropagate after every batch Implementation All our networks are based on the U-Net architecture [9], and share common building blocks (see Fig 2). U-Net was used as it performs well as a baseline architecture, and has shown competitive performance on various datasets from different modalities, of different organs, cancer types, and data types [14,35]. The teacher consists of three levels, one of each downsampling operation, going from an image resolution of 128 × 128 × 128 to 16 × 16 × 16. The students are comprised of four levels. In contrast to the U-Net architecture, our design performs downsampling by applying 3D convolutions with a stride of two. We also substituted the ReLU [36] activation function with PReLU [37]. We implemented two related students: one that produces semantic segmentation output only, which we call the Single Output Student (SO Student), and one that produces an additional output approximating the bounding box surrounding the tumor in the axial plane, which we call the Dual Output Student (DO Student). The architecture of the student networks can be seen in Fig 2. Firstly, the original U-Net design was too heavy to be applied in 3D directly. The architecture was therefore tuned to be better suited for the task and dataset. Hyperparameters were chosen through a systematic search. However, a rigorous search was not feasible due to the long training runtime. The teacher architecture and training hyperparameters were then frozen, before the teacher-student design was introduced. This was done to make comparison fair between the designs. The SO student was one level deeper than the teacher, but trained in the same manner. The DO student was identical to the SO student, but a second decoder branch was added to investigate the potential benefit of using both annotation types (semantic and bounding box labels) in training. The second decoder branch had a second loss to predict bounding boxes. The aim of the second branch was to improve localization of the tumor, as using the bounding box labels would make it learn different features.
The Adam [38] optimizer with a learning rate of 10 −4 was used for training until DSC validation convergence. The batch size was set to one and virtually increased to eight using accumulated gradients. The gradients were computed using the Dice Loss function [39], based on the Dice Similarity Coefficient (DSC). The models were trained for a maximum of 350 epochs, or until overfitting occured. The best model was selected based on the lowest validation loss.

Empirical evaluation
To evaluate our framework we considered two primary scenarios, each with two sub-experiments. We considered one scenario where the size of the strongly annotated dataset (� 500 images) is similar to the size of the weakly annotated dataset (� 750 images), and another scenario where the strongly annotated dataset was considerably smaller (� 50 images) than the weakly annotated dataset(� 1000 images). Within each scenario, we evaluated two semi-supervised models and three fully-automatic models. Among the three fully-automatic models, one model was trained solely on strongly annotated data, whereas the two other were student networks trained both on strongly annotated data and the teacher-annotated pseudo labels.
We used different metrics for evaluating and comparing the models. The DSC was used to measure the semantic segmentation performance, whereas F1-score was used to determine object-wise localization performance. We also used DSC-TP to evaluate the segmentation accuracy considering only true positives (TPs). We considered objects to be true positives if there were � 25% overlap between the predicted mask and the GT mask, motivated by a prior study [40].
The test set was sampled at random and accounted for 15% of the total dataset. The same split was used for all experiments to preserve fairness in evaluation. Patients with multiple scans were stratified into the three subsets: train, validation, and test. To counter the tumor size imbalance, we balanced the train and validation sets with regard to tumor sizes. Images containing tumors of more rare sizes were upsampled.
Models were trained using a workstation with a 14-core Intel Core i9 10940X @3.30 GHz CPU, 128 GB RAM, and two NVIDIA RTX 8000 (48 GB) GPUs. The most memory intensive student used, at its peak, �22.54GB VRAM during training, but inference can be performed with 3GB VRAM. Implementation was done in Python 3.7, built upon the MONAI [41] framework (v0.4.0), using PyTorch v1.6, and CUDA 11.0. The best performing model and corresponding inference code are made openly available as a command line tool at https://github. com/VemundFredriksen/LungTumorMask.

Vast strongly annotated dataset
As seen in Table 2, the teacher guided by the bounding boxes, outperformed the point guided (without bounding boxes as input) teacher on both datasets in terms of DSC. The difference between the two models was less prominent measured on the MSD-Lung dataset than for the NSCLC-Radiomics dataset.
For the final inference models, the DSC was highest on the MSD-Lung dataset, across all three models (see Table 3). The best performing student network overall was the SO Student, with highest DSC on the MSD-Lung dataset. There was negligible difference between the three

PLOS ONE
Teacher-student approach for lung tumor segmentation from mixed-supervised datasets models on the NSCLC-Radiomics dataset. The Baseline model performed best on the MSD-Lung dataset, both in terms of DSC and F1-score.

Scarce strongly annotated dataset
When reducing the strongly labeled dataset, the performance of the point guided teacher was degraded, whereas the box guided teacher still performed well (see Table 4). A similar trend applies to the final inference models (see Table 5). The baseline model performed poorer, whereas the student networks still performed well. The same can be seen from the object-wise metrics, although the difference was more prominent. Contrary to the results shown in Table 3, the SO Student had the highest DSC measured in this scenario. Fig 3 shows a sample of the outputs produced by the models in the scarce scenario.

Discussion
In this paper, a teacher-student design to segment lung tumors from CTs has been proposed. Three datasets of two different annotation types were used for this purpose. The teacher model was first trained on the datasets that had strong annotations. It was then used to generate pseudo-strong annotations for the student. Both the teacher and the student used U-Net-like architectures, and were evaluated on segmentation performance. In addition, the student networks were evaluated on sensitivity to annotation type and sample size.
We observed that the box guided teacher outperformed the point guided teacher in both scenarios. This was expected as the bounding box annotations assist the teacher by serving as a segmentation and localization constraint. The effect of the box guidance is especially visible in The best performing model with respect to mean dice similarity coefficient (DSC) is highlighted in bold. https://doi.org/10.1371/journal.pone.0266147.t004

PLOS ONE
Teacher-student approach for lung tumor segmentation from mixed-supervised datasets the scarce scenario, where the box guided teacher achieved almost double the DSC as the point guided one. The scarce box guided teacher also outperformed the vast point guided teacher. This suggests that training a teacher on a smaller set of bounding box annotated images can be advantageous compared to training a teacher on a large set of point guided images. Surprisingly, the students did not perform better than the baseline in the scenario with vast strongly annotated data (see Table 3). Measured on the MSD-Lung dataset, the baseline model outperformed the two students, whereas the opposite was observed for the NSCLC-Radiomics dataset. A potential explanation might be that the Lung-PET-CT-Dx dataset contains tumors with sizes more similar to the NSCLC-Radiomics dataset than to the tumors in MSD-Lung. The introduction of the Lung-PET-CT-Dx dataset may have led the students to perform better on larger tumors, but may have degraded the results on smaller tumors typically found in MSD-Lung. Another explanation might be that the ratio between strong and weak labels were not large enough to make a noticeable difference. This was further demonstrated when the models were evaluated in the scarce scenario (see Table 5). In this scenario, the students significantly outperformed the baseline supervised model. This demonstrates that the introduction

PLOS ONE
Teacher-student approach for lung tumor segmentation from mixed-supervised datasets of suboptimal annotations into the teacher-student design can improve performance of an end-to-end segmentation model.

State-of-the-art comparison
We observed a DSC comparable with state-of-the-art performance measured on the MSD-Lung dataset, with a F1-score of 85.18, and a DSC of 71.00, for one of our students. Isensee et al. [14] reported a DSC of 69.2 on the MSD-Lung dataset, whereas Carvalho et al. [11] reported a DSC of 70.9. Our model trained on only 40 human annotated images scored marginally better, although on a different test set. Other state-of-the-art results demonstrated better performance on the radiomics dataset. Pang et al. [42]  However, all of these related work performed considerable data sanitation, which we did not, making the comparison unfair. Furthermore, our results suggest that 40 images is not enough to train a supervised model, but enough to train a semi-supervised model that can enhance a supervised model by increasing the available data in a cheaper way than manual delineation. This finding highlights the advantage of using a teacher-student design, such as ours, that can utilize datasets with poorer annotations. It is considerably faster to annotate tumors with bounding boxes than with semantic segmentation, but with negligible loss in performance. This finding suggests that it is advantageous to spend the time annotating more images with poorer supervision than to spend the same amount of time annotating fewer images with higher quality. The other highlighted papers performing lung tumor segmentation cannot take advantage of this effect as they rely solely on fully supervised training on high quality data.

Data noise argument
The datasets were of varying quality. The MSD-Lung dataset was of high standard, whereas the NSCLC-Radiomics dataset was less so. Other publications that used the NSCLC-Radiomics dataset reported heavy data sanitation, effectively removing large parts of the dataset [13,42,43]. We did not override the expert's annotations, as we also seek to handle suboptimal annotations, if these should be present in a data set. The flawed dataset explains why the difference between the box guided and point guided teacher is larger on the NSCLC-Radiomics dataset than MSD-Lung. Images where the tumor is poorly, or even completely wrongly annotated, the box guided teacher can rely on the bounding boxes to achieve a good DSC, but since the annotation itself is wrong, the point guided teacher struggles.

Limitations
One of the major limitations in this experiment was the scarce amount of data. The test set was sampled randomly from each dataset. It is plausible that a different sample of the test set would have given different result. Although K-fold cross validation could be used to eliminate this concern, it was dropped due to time limitations. K-fold cross validation is a time consuming strategy. It depends on training K different models, which would take a considerable amount of time, even with a small K in our situation. Since our method is a two step method that relies on two training steps, the K-fold cross validation would take nearly double the time of a similar single-step method as well.
Another limitations of this experiments was that the students were sensitive to voxel spacing. By reducing the voxel spacing during normalization/preprocessing, thus increasing the resolution of the image, the DSC did not improve, but actually degraded. Therefore, it is possible that the proposed architecture is sensitive to small adjustments in the preprocessing pipeline.

Future work
An alternative to using 2D bounding boxes in the axial plane, is to use a 3D bounding box. As one 3D bounding box contains much fewer corners than multiple 2D bounding boxes, this could further reduce annotation load. It is reason to believe that a teacher trained on 3D boxes will perform worse than one trained on 2D axial boxes. However, if the reduction in annotation load is significant, the amount of data that can be annotated for the teacher might weigh up the loss in precision of the annotation. After all, this is the very fundamental idea behind the teacher-student design. However, we feel that a much larger dataset should be used to explore this properly.
The main motivation of using a teacher-student design is to improve models by learning from additional suboptimal annotated or unannotated data. We observed a benefit of using such a design for lung tumor segmentation in CTs. However, a single-step teacher might not be sufficient. It has been proposed to train both the teacher and student end-to-end in an iterative fashion [25]. This makes sense as the teacher could improve from the student's feedback. Especially from multiple students, which iteratively could improve the students as well, as the teacher become more experienced. However, for 3D applications this is likely infeasible. Alternatively, one could potentially use multiple teachers, trained on different types of images that focus on different lung tumor types and sizes. Having specialized teachers to train the student in an ensemble manner makes sense as it more closely represents the natual teacher-student relation from academia.

Conclusion
We present the first known implementation of a mixed-supervised teacher-student framework for lung tumor segmentation from CT images. Our method utilized both semantic and axial bounding box annotations to maximize lung tumor segmentation performance. We demonstrated that with sufficient bounding box annotated data, our teacher-student framework achieved state-of-the-art performance, even with scarce semantic annotated data. In a scenario with only 40 semantic labeled images and �1000 bounding box labeled images, one of our models reached a mean DSC of 71.0 measured on nine images from the MSD dataset.